Data Hackathon 3.x Data Preparation

Load Libraries:



In [9]:

    
import pandas as pd
import numpy as np
%matplotlib inline



In [10]:

    
#Load data:
data_path = '/Users/wy/notebook/dataset/Parameter_Tuning_XGBoost_with_Example/'
train = pd.read_csv(data_path+'Train_nyOWmfK.csv')
test = pd.read_csv(data_path+'Test_bCtAN1w.csv')



In [11]:

    
train.shape, test.shape









    Out[11]:





((87020, 26), (37717, 24))



In [13]:

    
train.dtypes









    Out[13]:





ID                        object
Gender                    object
City                      object
Monthly_Income             int64
DOB                       object
Lead_Creation_Date        object
Loan_Amount_Applied      float64
Loan_Tenure_Applied      float64
Existing_EMI             float64
Employer_Name             object
Salary_Account            object
Mobile_Verified           object
Var5                       int64
Var1                      object
Loan_Amount_Submitted    float64
Loan_Tenure_Submitted    float64
Interest_Rate            float64
Processing_Fee           float64
EMI_Loan_Submitted       float64
Filled_Form               object
Device_Type               object
Var2                      object
Source                    object
Var4                       int64
LoggedIn                   int64
Disbursed                  int64
dtype: object



In [14]:

    
#Combine into data:
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)
data.shape









    Out[14]:





(124737, 27)

Check missing:



In [15]:

    
data.apply(lambda x: sum(x.isnull()))









    Out[15]:





City                      1401
DOB                          0
Device_Type                  0
Disbursed                37717
EMI_Loan_Submitted       84901
Employer_Name              113
Existing_EMI               111
Filled_Form                  0
Gender                       0
ID                           0
Interest_Rate            84901
Lead_Creation_Date           0
Loan_Amount_Applied        111
Loan_Amount_Submitted    49535
Loan_Tenure_Applied        111
Loan_Tenure_Submitted    49535
LoggedIn                 37717
Mobile_Verified              0
Monthly_Income               0
Processing_Fee           85346
Salary_Account           16801
Source                       0
Var1                         0
Var2                         0
Var4                         0
Var5                         0
source                       0
dtype: int64

Look at categories of all object variables:



In [16]:

    
var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
    print '\nFrequency count for variable %s'%v
    print data[v].value_counts()









    



Frequency count for variable Gender
Male      71398
Female    53339
Name: Gender, dtype: int64

Frequency count for variable Salary_Account
HDFC Bank                                          25180
ICICI Bank                                         19547
State Bank of India                                17110
Axis Bank                                          12590
Citibank                                            3398
Kotak Bank                                          2955
IDBI Bank                                           2213
Punjab National Bank                                1747
Bank of India                                       1713
Bank of Baroda                                      1675
Standard Chartered Bank                             1434
Canara Bank                                         1385
Union Bank of India                                 1330
Yes Bank                                            1120
ING Vysya                                            996
Corporation bank                                     948
Indian Overseas Bank                                 901
State Bank of Hyderabad                              854
Indian Bank                                          773
Oriental Bank of Commerce                            761
IndusInd Bank                                        711
Andhra Bank                                          706
Central Bank of India                                648
Syndicate Bank                                       614
Bank of Maharasthra                                  576
HSBC                                                 474
State Bank of Bikaner & Jaipur                       448
Karur Vysya Bank                                     435
State Bank of Mysore                                 385
Federal Bank                                         377
Vijaya Bank                                          354
Allahabad Bank                                       345
UCO Bank                                             344
State Bank of Travancore                             333
Karnataka Bank                                       279
United Bank of India                                 276
Dena Bank                                            268
Saraswat Bank                                        265
State Bank of Patiala                                263
South Indian Bank                                    223
Deutsche Bank                                        176
Abhyuday Co-op Bank Ltd                              161
The Ratnakar Bank Ltd                                113
Tamil Nadu Mercantile Bank                           103
Punjab & Sind bank                                    84
J&K Bank                                              78
Lakshmi Vilas bank                                    69
Dhanalakshmi Bank Ltd                                 66
State Bank of Indore                                  32
Catholic Syrian Bank                                  27
India Bulls                                           21
B N P Paribas                                         15
Firstrand Bank Limited                                11
GIC Housing Finance Ltd                               10
Bank of Rajasthan                                      8
Kerala Gramin Bank                                     4
Industrial And Commercial Bank Of China Limited        3
Ahmedabad Mercantile Cooperative Bank                  1
Name: Salary_Account, dtype: int64

Frequency count for variable Mobile_Verified
Y    80928
N    43809
Name: Mobile_Verified, dtype: int64

Frequency count for variable Var1
HBXX    84901
HBXC    12952
HBXB     6502
HAXA     4214
HBXA     3042
HAXB     2879
HBXD     2818
HAXC     2171
HBXH     1387
HCXF      990
HAYT      710
HAVC      570
HAXM      386
HCXD      348
HCYS      318
HVYS      252
HAZD      161
HCXG      114
HAXF       22
Name: Var1, dtype: int64

Frequency count for variable Filled_Form
N    96740
Y    27997
Name: Filled_Form, dtype: int64

Frequency count for variable Device_Type
Web-browser    92105
Mobile         32632
Name: Device_Type, dtype: int64

Frequency count for variable Var2
B    53481
G    47338
C    20366
E     1855
D      918
F      770
A        9
Name: Var2, dtype: int64

Frequency count for variable Source
S122    55249
S133    42900
S159     7999
S143     6140
S127     2804
S137     2450
S134     1900
S161     1109
S151     1018
S157      929
S153      705
S144      447
S156      432
S158      294
S123      112
S141       83
S162       60
S124       43
S150       19
S160       11
S136        5
S138        5
S155        5
S139        4
S129        4
S135        2
S131        1
S130        1
S132        1
S125        1
S140        1
S142        1
S126        1
S154        1
Name: Source, dtype: int64



In [19]:

    
data['Gender'].value_counts()









    Out[19]:





Male      71398
Female    53339
Name: Gender, dtype: int64

Handle Individual Variables:

City Variable:



In [22]:

    
print len(data['City'].unique())
#drop city because too many unique
data.drop('City',axis=1,inplace=True)

Determine Age from DOB



In [23]:

    
data['DOB'].head()









    Out[23]:





0    23-May-78
1    07-Oct-85
2    10-Oct-81
3    30-Nov-87
4    17-Feb-84
Name: DOB, dtype: object



In [27]:

    
#Create age variable:
data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-2:]))
data['Age'].head()









    Out[27]:





0    37
1    30
2    34
3    28
4    31
Name: Age, dtype: int64



In [28]:

    
#drop DOB:
data.drop('DOB',axis=1,inplace=True)

EMI_Load_Submitted



In [30]:

    
data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x109634650>



In [31]:

    
#Majority values missing so I'll create a new variable stating whether this is missing or note:
data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(5)









    Out[31]:






  
    
      
      EMI_Loan_Submitted
      EMI_Loan_Submitted_Missing
    
  
  
    
      0
      NaN
      1
    
    
      1
      6762.9
      0
    
    
      2
      NaN
      1
    
    
      3
      NaN
      1
    
    
      4
      NaN
      1



In [32]:

    
#drop original vaiables:
data.drop('EMI_Loan_Submitted',axis=1,inplace=True)

Employer Name



In [33]:

    
len(data['Employer_Name'].value_counts())









    Out[33]:





57193



In [34]:

    
#I'll drop the variable because too many unique values. Another option could be to categorize them manually
data.drop('Employer_Name',axis=1,inplace=True)

Existing EMI



In [35]:

    
data.boxplot(column='Existing_EMI',return_type='axes')









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x109634d10>



In [36]:

    
data['Existing_EMI'].describe()









    Out[36]:





count    1.246260e+05
mean     3.636342e+03
std      3.369124e+04
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      3.500000e+03
max      1.000000e+07
Name: Existing_EMI, dtype: float64



In [37]:

    
#Impute by median (0) because just 111 missing:
data['Existing_EMI'].fillna(0, inplace=True)

Interest Rate:



In [39]:

    
#Majority values missing so I'll create a new variable stating whether this is missing or note:
data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
print data[['Interest_Rate','Interest_Rate_Missing']].head(10)









    



   Interest_Rate  Interest_Rate_Missing
0            NaN                      1
1          13.25                      0
2            NaN                      1
3            NaN                      1
4            NaN                      1
5          13.99                      0
6            NaN                      1
7            NaN                      1
8          14.85                      0
9          18.25                      0



In [40]:

    
data.drop('Interest_Rate',axis=1,inplace=True)

Lead Creation Date:



In [41]:

    
#Drop this variable because doesn't appear to affect much intuitively
data.drop('Lead_Creation_Date',axis=1,inplace=True)

Loan Amount and Tenure applied:



In [42]:

    
#Impute with median because only 111 missing:
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)

Loan Amount and Tenure selected



In [43]:

    
#High proportion missing so create a new var whether present or not
data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)



In [44]:

    
#Remove old vars
data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)

Remove logged-in



In [45]:

    
data.drop('LoggedIn',axis=1,inplace=True)

Remove salary account



In [46]:

    
#Salary account has mnay banks which have to be manually grouped
data.drop('Salary_Account',axis=1,inplace=True)

Processing_Fee



In [48]:

    
#High proportion missing so create a new var whether present or not
data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
#drop old
data.drop('Processing_Fee',axis=1,inplace=True)

Source



In [49]:

    
data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)
data['Source'].value_counts()









    Out[49]:





S122      55249
S133      42900
others    26588
Name: Source, dtype: int64

Final Data:



In [50]:

    
data.apply(lambda x: sum(x.isnull()))









    Out[50]:





Device_Type                          0
Disbursed                        37717
Existing_EMI                         0
Filled_Form                          0
Gender                               0
ID                                   0
Loan_Amount_Applied                  0
Loan_Tenure_Applied                  0
Mobile_Verified                      0
Monthly_Income                       0
Source                               0
Var1                                 0
Var2                                 0
Var4                                 0
Var5                                 0
source                               0
Age                                  0
EMI_Loan_Submitted_Missing           0
Interest_Rate_Missing                0
Loan_Amount_Submitted_Missing        0
Loan_Tenure_Submitted_Missing        0
Processing_Fee_Missing               0
dtype: int64



In [51]:

    
data.dtypes









    Out[51]:





Device_Type                       object
Disbursed                        float64
Existing_EMI                     float64
Filled_Form                       object
Gender                            object
ID                                object
Loan_Amount_Applied              float64
Loan_Tenure_Applied              float64
Mobile_Verified                   object
Monthly_Income                     int64
Source                            object
Var1                              object
Var2                              object
Var4                               int64
Var5                               int64
source                            object
Age                                int64
EMI_Loan_Submitted_Missing         int64
Interest_Rate_Missing              int64
Loan_Amount_Submitted_Missing      int64
Loan_Tenure_Submitted_Missing      int64
Processing_Fee_Missing             int64
dtype: object

Numerical Coding:



In [53]:

    
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']
for col in var_to_encode:
    data[col] = le.fit_transform(data[col])

One-Hot Coding



In [55]:

    
data = pd.get_dummies(data, columns=var_to_encode)
data.columns









    Out[55]:





Index([u'Disbursed', u'Existing_EMI', u'ID', u'Loan_Amount_Applied',
       u'Loan_Tenure_Applied', u'Monthly_Income', u'Var4', u'Var5', u'source',
       u'Age', u'EMI_Loan_Submitted_Missing', u'Interest_Rate_Missing',
       u'Loan_Amount_Submitted_Missing', u'Loan_Tenure_Submitted_Missing',
       u'Processing_Fee_Missing', u'Device_Type_0', u'Device_Type_1',
       u'Filled_Form_0', u'Filled_Form_1', u'Gender_0', u'Gender_1', u'Var1_0',
       u'Var1_1', u'Var1_2', u'Var1_3', u'Var1_4', u'Var1_5', u'Var1_6',
       u'Var1_7', u'Var1_8', u'Var1_9', u'Var1_10', u'Var1_11', u'Var1_12',
       u'Var1_13', u'Var1_14', u'Var1_15', u'Var1_16', u'Var1_17', u'Var1_18',
       u'Var2_0', u'Var2_1', u'Var2_2', u'Var2_3', u'Var2_4', u'Var2_5',
       u'Var2_6', u'Mobile_Verified_0', u'Mobile_Verified_1', u'Source_0',
       u'Source_1', u'Source_2'],
      dtype='object')



In [58]:

    
data.shape









    Out[58]:





(124737, 52)

Separate train & test:



In [60]:

    
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']



In [61]:

    
train.drop('source',axis=1,inplace=True)
test.drop(['source','Disbursed'],axis=1,inplace=True)









    



/Users/wy/anaconda/envs/condapy2.7/lib/python2.7/site-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/Users/wy/anaconda/envs/condapy2.7/lib/python2.7/site-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app



In [62]:

    
train.to_csv(data_path+'train_modified.csv',index=False)
test.to_csv(data_path+'test_modified.csv',index=False)

reference

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

data_preparation.ipynb (Github)